Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(parser): merge lexer with parser & optimize parser #679

Merged
merged 18 commits into from
May 5, 2018

Conversation

fkleuver
Copy link
Member

@fkleuver fkleuver commented Apr 24, 2018

This is a rewrite of the parser that replaces all string-based token comparisons with bitwise operators to speed things up, and also combines the lexer into the parser.

Parsing binary expressions is flattened into a single recursive function that takes operator precedence into account (drastically reduces unnecessary nested calls).

Functionality should be identical afaik. Error checking/reporting might not be - still need to check various error scenarios to ensure the strictness is the same. I wanted to gather some feedback first before I spend a bunch of time on that though :)

Performance improvement of around 200% depending on the type of expressions parsed. Simple benchmark available here.

Overall file size slightly reduced (70 bytes less unminified, I guess a little more minified because I added some comments)

@jdanyow @EisenbergEffect Is this viable you think?

@fkleuver fkleuver force-pushed the new-parser branch 2 times, most recently from 3360a81 to 51f7956 Compare April 24, 2018 03:39
@EisenbergEffect
Copy link
Contributor

Wow! Very cool! @jdanyow and @bigopon Can you take a look at this?

@bigopon
Copy link
Member

bigopon commented Apr 24, 2018

I like this, especially binary expressions parsing, it's clever (though I'd say old parser is easier to digest), and tests look nice. Checked out and tested on one of my projects. 1 or 2 places to pick for some more gains. It's @jdanyow @EisenbergEffect to decide what to do with Lexer and Token

@fkleuver
Copy link
Member Author

fkleuver commented Apr 24, 2018

@bigopon Thanks for taking a look!

1 or 2 places to pick for some more gains

Which ones?

I know numeric and unicode parsing can still be optimized quite a bit by processing the character code points and performing direct conversion, rather than using parseInt, parseFloat and regex. I could probably do the numeric parsing. Improving unicode is opening a pandora's box though. Then identifiers should also be taken along imo, and there's quite a bit of work there

@Alexander-Taran
Copy link
Contributor

awesome "refactoring"

@fkleuver
Copy link
Member Author

Updated number scanning now, for normal integers it will calculate them instead of parse which is about 15x faster. For floats there's no difference. Also removed the offset thing from error(). As far as performance goes that's pretty much all the low-hanging fruits now afaik.

@fkleuver
Copy link
Member Author

fkleuver commented Apr 26, 2018

My takeaway from the feedback so far is that it's a nice idea but the readability could be improved, so with the latest 2 commits I cleaned things up a bit.

  • Wrapped input.charCodeAt in currentChar getter (was peek in the Lexer)
  • Wrapped index++ in nextChar() function (was advance() in Lexer)

And a few more misc things where I reuse functions instead of inlining things.

The inlining didn't give nearly as big a performance boost as the switch from string tokens to bitwise tokens. Runtime of the mini benchmark went from ~116ms to ~124ms (compared to ~360ms without the changes) so I'd argue a few % performance loss is probably worth the gain in readability/maintainability.

Regarding splitting the parser/lexer, I kind of had to throw everything on one big pile first so I could see the whole picture. Main point was to parse while scanning (instead of scanning everything and then parsing) as this saves some memory and array operations.
The classes don't really need to be separate. If you guys prefer it, I could probably split them up again without without any significant performance loss. That's up to you guys now.

@fkleuver
Copy link
Member Author

Side note, the runtime could probably also be sped up a little bit by using the numeric tokens instead of strings in the Binary.evaluate (potentially with a function lookup table instead of a switch). Downside is during debugging you see numbers instead of strings like '===' if you inspect the AST. Opinion on this @bigopon @jdanyow @EisenbergEffect ?

@EisenbergEffect
Copy link
Contributor

I'm ok with having it as one class or splitting it. No real hard opinion from me. It it's easier to maintain together, then I think it's fine to keep it as one. I'd like to get another pass from @bigopon on this and maybe some feedback from @jdanyow as well, to make sure there aren't any missed edge cases that for some reason weren't covered under existing tests.

I'm pretty excited about this :)

@jdanyow
Copy link
Contributor

jdanyow commented Apr 28, 2018

This looks awesome- nice work 💪 🚀

My only feedback is around the benchmark- I'd like to see the test cases skewed more towards typical binding use cases, just to confirm we don't have a regression there (I doubt we will). Bindings using AccessScope and AccessMember expressions outnumber other types of expressions by a large factor in typical apps. Could you update the benchmark link to use at least 4 times as many simple expressions like fooA, fooB, fooC, ... , fooZ and foo.barA, foo.barB, etc?

@fkleuver
Copy link
Member Author

fkleuver commented Apr 28, 2018

@jdanyow Good point, I split up the individual parsing measurements so you can see where the differences are. I added a weight factor just in case, so the Access expressions make up the majority of all parses.

I noticed that my version seems to benefit a lot from specifically the way Chrome optimizes hot paths.

For the first run (page reload -> press test) the results fluctuate between 90% and 140%. There's a regression in parsing floats but that's to be expected - this is a tradeoff for speeding up parsing integers (which I think are far more common)

For subsequent runs (press test -> press test -> press test) the results stabilize around 180%. Doesn't really help with initial page loads, but would be interesting to see if things can be optimized beforehand with SSR / AOT by hinting the browser on where the hot paths are.

Relative results in Edge and FF are somewhat similar for first runs, though Edge is about 3x slower overall and FF about 2x slower. I guess it's mostly array access and switch statement related. Chrome really optimized those.

Copy link
Contributor

@jdanyow jdanyow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

args.push(this.parseExpression());
}

result = new ValueConverter(result, name, args, [result].concat(args));
result = new ValueConverter(result, name, args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

src/parser.js Outdated
constructor() {
this.cache = {};
this.lexer = new Lexer();
this.cache = Object.create(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jdanyow
Copy link
Contributor

jdanyow commented Apr 28, 2018

@fkleuver what are your thoughts on #641? Is there a better way of handling that in the new Parser architecture? Could we handle it out of the box while keeping backwards compat and not adding too much bloat?

@fkleuver
Copy link
Member Author

fkleuver commented Apr 29, 2018

@jdanyow Unicode shouldn't be an issue as long as we stick to BMP. Without having to deal with surrogate pairs, it can be as simple as a lookup table (with all the valid identifier code points) in the default switch case in scanToken(). A lookup table can be generated based on unicode-10.0.0. I don't see any performance drawbacks there but it would increase the size of the build.

@EisenbergEffect
Copy link
Contributor

EisenbergEffect commented Apr 29, 2018 via email

@fkleuver
Copy link
Member Author

@EisenbergEffect Approximately 10kb for identifier start and identifier part (compressed)

@EisenbergEffect
Copy link
Contributor

I would prefer not to add something like that to the core. Do you think there is some way that it could hook in as a plugin for people who really want it?

@fkleuver
Copy link
Member Author

fkleuver commented Apr 29, 2018

Sure. Could add something like a ParserOptions class with a ScanUnicodeIdentifier flag/method that gets called if ScanToken falls through to the default case.

It doesn't necessarily have to be a standalone plugin btw. As long as the lookup is in a separate file, the build script could output an additional aurelia-binding-unicode file that includes it, then users can specifically include that version instead of the default if they want. I've seen @bigopon do some really neat stuff composing a bundle of only the stuff you need. If that works for templating, it should work for binding too :)

@EisenbergEffect
Copy link
Contributor

The trick is that it needs to work with all build types, including require.js with our current CLI.

@fkleuver
Copy link
Member Author

fkleuver commented Apr 29, 2018

Hm, took a look at it and having an alternative dist/../aurelia-binding-unicode.js probably causes more problems than it solves due to the existing hard references in the tooling etc.

I added an initial idea to do this via a simple overridable extension point. Performance impact for the default implementation is negligible as it only does one extra call on the last character of an identifier.

Users could provide their own implementation and register it as IdentifierValidator themselves, or a plugin could do this. That's probably the best of both worlds.

I'm not 100% sure if this is the best name and/or form of encapsulation for the 2 methods. I was thinking it could be better to have a more generic ...Options/Extensions class injected into the parser, so the signature doesn't need to be changed if extra options need to be added in the future.

Thinking here of things like auto object creation (#205), debug logging (#30), globally adding a value converter (don't recall the issue).

What do you guys think? Have a single ...Options kind of class injected, or just extend with additional constructor arguments if extra stuff is needed in the future?

Or to keep the iteration size manageable, leave unicode out of scope for this PR and do that separately.

@jdanyow
Copy link
Contributor

jdanyow commented Apr 29, 2018

For sure- out of the scope of this PR, just wanted your thoughts on it. Thanks!

@fkleuver
Copy link
Member Author

fkleuver commented Apr 29, 2018

Alright, I removed that last commit.

To tie up any remaining loose ends, there are still 3 //todo's from the old parser that I wasn't sure what to do with.

  1. // todo(kasperl): Is this really supposed to be expressions?
    Can't think of any expression that wouldn't work as parameters to behaviors/converter. Maybe cover the various possibilities with tests and then remove the comment?

  2. // todo(kasperl): Check that this is an identifier. Are keywords okay?
    According to the spec they are not, but in practise every browser seems to be fine with identifiers like undefined, null, true, etc. Excluding those would be a breaking change so I'd say check that it's an identifier, literal or indexed property or a keyword. That should cover all working scenarios I think?

  3. // todo(kasperl): Stricter checking. Only allow identifiers and strings as keys. Maybe also keywords?
    Same as 2.

(edit: just realized literals and indexed properties are already covered by AccessKeyed)

@fkleuver
Copy link
Member Author

Alright, with the latest commit the todo's are gone. Mind double checking that I addressed them in a desirable manner? Other than that I think I'm done :)

@jdanyow
Copy link
Contributor

jdanyow commented Apr 29, 2018

last two commits look good. I think the two kasperl todos were holdovers from the original angular dart source for the lexer/parser/ast.

@fkleuver
Copy link
Member Author

Cool cheers. Let me know if you need anything else for ze merge.

@bigopon
Copy link
Member

bigopon commented Apr 30, 2018

For extension, we can apply the same strategy like Unparser. But will be opt in instead of opt out

@3cp
Copy link
Member

3cp commented May 1, 2018

Damn hardcore! Just want to provide feedback that it all worked in my app.

@fkleuver
Copy link
Member Author

fkleuver commented May 1, 2018

Nice, thanks!

@stsje
Copy link

stsje commented May 1, 2018

Will the performance improvement be noticeable in business applications?

@fkleuver
Copy link
Member Author

fkleuver commented May 1, 2018

@stsje Probably not. The framework is already hardly noticeable - most time is spent rendering, waiting for the network or executing app logic. Parsing makes up less than 1% of total execution time during the initial load (as do most other processes within Aurelia). The biggest performance improvements are in the hands of app developers.

It can make a difference when you're doing a lot of dynamically generating html and compiling it on-the-fly and/or using the computed observation adapter on many getters with large function bodies. That's rare though and definitely not the goal here.

Firstly I see this as a proof of concept for applying techniques that may be able to speed up other parts of the framework, slightly decrease initial load times, perform better in benchmarks and so on. Overall it's in the spirit of Aurelia to try and push performance as much as possible, this is just one small step.

Secondly I hope this opens the door for implementing a larger portion of the JavaScript syntax for binding to do more interesting things - with this performance gain, some room is created some to do other stuff which would otherwise not be possible without a performance loss.

@jdanyow
Copy link
Contributor

jdanyow commented May 3, 2018

@fkleuver I was getting ready to merge this- noticed a small issue with one of the patterns used in the parser tests that have try-catch logic- I think these changes need to be made to each try-catch style test:

  describe('does not parse LiteralObject with computed property', () => {
    const expressions = [
      '{ []: "foo" }',
      '{ [42]: "foo" }',
      '{ ["foo"]: "bar" }',
      '{ [foo]: "bar" }'
    ];

    for (const expr of expressions) {z
      it(expr, () => {
+       let error = null;
        try {
          parser.parse(expr);
-         pass = true;
        } catch (e) {
-         expect(e.message).toContain('Unexpected token [');
+         error = e;
        }
+       expect(error).not.toBeNull();
+       expect(error.message).toContain('Unexpected token [');
      });
    }
  });

otherwise if the parse doesn't throw as expected, the test will still pass.

@fkleuver
Copy link
Member Author

fkleuver commented May 3, 2018

@jdanyow Glad you caught that, apparently there were a couple duplicate tests in there as well and two little issues with the invalid assignment tests. All should be sorted now.

@jdanyow jdanyow merged commit fd6083f into aurelia:master May 5, 2018
@jdanyow
Copy link
Contributor

jdanyow commented May 5, 2018

nice work @fkleuver!

@Alexander-Taran
Copy link
Contributor

awesome !!!!11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants